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Abstract 

Content integration in performance assessment involves mixing different areas of 
knowledge in one assessment. In this type of testing situation, assessment tasks are 
designed to measure the ability in which students may solve problems by applying their 
knowledge and skills in multiple content areas. This study examined the effect of 
integrated science and reading items on the reliability and construct validity of reading 
assessment for Maryland School Performance Assessment Program (MSPAP). Using the 
MSPAP 1998 reading data, this study demonstrated that the integrated science-reading 
items provide reliable and valid information about student’s reading abilities that are 
comparable to the non-integrated items. The results suggested that the integrated 
science-reading items did not compromise the integrity of reading scales in question. 
There is evidence that the scales are essentially uni-dimensional. However, there are 
minor task-related factors that warrant further analysis on the latent structure of the 
reading scales. This finding is complimented by the high local item dependency among 
tasks. 
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Perspectives 

In recent years several states have embraced content area integration as a 
preferred method for delivering curricula. This process involves the creation of 
instructional activities that combine skills and knowledge from multiple content areas 
such as reading and writing, reading and science, and mathematics and science. The 
integrated activities are expected to foster the development of general cognitive processes 
that may be useful in dealing with these problems (Herman, et al., 1992). Along with the 
use of integrated instructional activities, many states have relied on integrated tasks as 
part of their statewide assessment programs (MSDE, 1999; NYSED, 1998). Integrated 
tasks are also used in comparative studies of mathematics and science achievement across 
countries (Harmon et. al, 1997). 

In performance assessment, content integration typically involves establishing a 
real-life context with a sense of purpose and intended audience (MSDE, 1999). The 
assessment tasks then require examinees to apply skills and knowledge from multiple 
content areas in order to solve real-life problems. (We will refer to these as integrated 
tasks in the remainder of this paper.) There are, of course, assessment tasks (non- 
integrated tasks) that are tied to only one content area. 

For instructional purposes, test scores are needed for each separate content area 
(such as reading). To fulfill this need, item responses to integrated and non-integrated 
assessment tasks are often calibrated together in order to create the needed scale. There 
are a number of issues relating to the use of integrated and non-integrated assessment 
tasks (or items) in creating a scale for a given content area. One such issue relates to the 
nature of the construct assessed by the combined items. The strong person by task 
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interaction often found in performance assessment (Shavelson et al. 1997) may imply that 
tasks involving more than one content area may assess a construct that is different from 
the construct tapped by items that involve a single content area. The interaction may also 
imply that tasks that involve different combinations of content integration may assess 
different constructs. For example, reading tasks integrated with writing may measure 
different construct than reading tasks integrated with science. 

Using data from 1994 Maryland Performance Assessment Program (MSPAP), 
Ercikan et al. (1997) examined the effect of mathematics and science integration on the 
validity and reliability of the separate mathematics and science scales. Their results 
indicate that the integrated mathematics and science items do not compromise the 
integrity of the separate scales. The contributions to test reliability for the integrated and 
non- integrated items are similar. The confirmatory factor analysis suggests that the 
integrated and non-integrated items measure the same construct. 

The results obtained by Ercikan and her associates are based on science and 
mathematics. It is well known that these two content areas are quite similar to each other 
in terms of cognitive processes; so it may not be surprising that the integrated and non- 
integrated items measure similar constructs. It may be of considerable interest to know if 
construct similarity still hold in integrating assessment tasks from (seemingly unrelated) 
areas such as reading, science, mathematics and social studies. 
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Objectives 

This study focuses on the effect of content integration to the construct validity of 
reading assessment that integrates with science. Data from the Maryland School 
Performance Assessment Program (MSPAP) are used to answer two fundamental 
questions: 

(1) Do reading assessment tasks that integrated with science assess different 
constructs from non-integrated reading task? 

(2) Is it meaningful to combine the scores on integrated and non-integrated items 
to yield a single reported total score? 

The similarity of constructs measured by integrated and non-integrated items was 
examined by item-factor analysis and item-test correlation. The effect of calibrating 
integrated and nonintegrated together to yield a common scale was examined in terms of 
item fit and local item dependency. The validity implication of scaling and deriving scale 
scores including both item types on the same scale are discussed. 

Data 

This study uses reading assessment data for grade 8 from the 1998 MSPAP 
assessment. MSPAP is a statewide performance assessment for grade 3, 5 and 8 students. 
Three non-equivalent forms of the test are administered to randomly equivalent groups of 
students at each school, with six to eight tasks per form. Some tasks are single content 
area tasks (non-integrated tasks) and others are integrated (integrated tasks). There are 
two reading tasks in each of the two forms we used: one non-integrated reading task and 
one reading-science integrated task. A caveat must be made about these tasks. In two of 
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the tasks, one writing question was integrated at the end of the task. Strictly speaking, 
the effect of integrating writing within a reading task can also be considered. Since the 
only purpose of integrating writing items at end of some MSPAP tasks is to provide a 
rich context in which writing can be assessed, its potential effect to reading assessment is 
assumed to be minimum and therefore not being considered in this study. 

A random sample of sample of 7502 students was extracted from the statewide 
data file for each form. Only the grade 8 data were chosen in this study because 
instructional content integration in the middle school (grade 8) is not as prevalent as in 
the primary school (grades 3 and 5). All MSPAP items are constructed response and are 
scored by Maryland teachers using activity- specific scoring tools. Item Response 
Theory (IRT) procedures were used to calibrate all reading items together in each of the 
three forms. The scaling procedures were implemented using the two-parameter partial 
credit model (Yen, 1993; Yen & Ferrara, 1997). 

Method 

Item Factor Analysis 

The similarity of constructs measured by integrated and non-integrated items was 
first examined by item-factor analysis. Unidimensionality is an assumption of the item 
response theory (IRT) model used to calibrate MSPAP items. The unidimensionality 
assumption is seldom strictly met since there is always a possibility that other factors 
present that can affect test performance. The purpose of the item-factor analysis is to 
examine whether the test is ‘essentially unidimensional’ (Stout, 1990). 
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We consider data obtained from a random sample of 1,000 student each form, 
drawn from among 7,502 students. Sample size of 1,000 was chosen because this is 
adequate sample size for getting the asymptotic distribution free estimators used in this 
study (Bentler, 1995). Principal component analysis was first applied to examine the 
number of principal components that need to be retained based on the eigen-value- 
greater-than-one rule. Principal factor analysis was then conducted with the maximum 
correlation with any other items used as the communality estimates in the diagonal. The 
factors were rotated toward the hypothesis matrix by oblique promax solution. 

To factor analyze mathematics and science integrated MSPAP tests, Ercikan et al. 
(1997) fit a factor model where the integrated and non-integrated items form a different 
but correlated factor. It is hypothesized that items that are associated with the non- 
integrated reading task form a different factor than those items associated with the 
integrated reading task. Such a model may be interpreted as the single-level realization of 
a hierarchical factor model, in which at a second order level, correlated first-order factors 
depend on a single underlying general factor (Bentler, 1995). 

We also consider an alternative model: a bi-factor model that includes a general 
factor for all the items, plus two specific factors for each task (Schmid & Leiman, 1957). 
Unlike the hierarchical factor model, items are allowed to load directly on the general 
dimension. It is plausible for performance assessment tests where the primary dimension 
measures the targeted process skill and additional factors describe content area 
knowledge within tasks. In this model, item would be conditional independent between 
tasks, but conditionally dependent within task (Hozinger, K. J. & Swinefor, F. 1937). 
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Confirmatory factor analysis was conducted using EQS (Bentler, 1998). For 
ordered categorical data that are not themselves normally distributed, but may be 
assumed to reflect underlying variables that are normally distributed, it is recommended 
that polychoric correlation is used to represent the population correlation matrix 
(Gorsuch, 1983). However, estimation of polychoric correlation estimation requires a 
large sample and its accuracy may be problematic for small sample size. Since the 
categorical data methods require the cross-tabulation tables of the categorical variables. 
These tables will be sparse if there are too few subjects in some of the cells, then the 
computational procedures can break down (Bentler, 1995). Therefore, we chose to use 
the variance and covariance matrix for the estimation after confirming that the 
multivariate normality assumption seems to hold. The standardized skewness and 
kurtosis were examined for univariate normality. Mardia's coefficient provided in EQS 
measures of the degree to which the assumption of multivariate normality has been 
violated. Its normalized estimate is distributed as unit normal variate. Large value 
indicates positive kurtosis and large negative value indicate significant negative kurtosis 
(Bentler, 1995). 

We consider the fully iterated arbitrary Generalized Least Square estimator 
provided by EQS, which is Browne's asymptotic distribution free (ADF) estimator 
(Browne, 1982, 1984). The Generalized Least Square method has been shown to be 
robust to the violation of normal theory. The ADF estimation procedure does not assume 
multivariate normality of the response data. Rather, the assumption of multivariate 
normality is replaced by Elliptical theory which allows symmetrical distributed variables 
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to have either heavier or lighter tails than their normal theory counterparts, but the 
kurtosis of the variables are assumed to be equal (Browne, 1982, 1984; Bentler, 1995). 

In conducting item factor analysis with sizable number of items, most patterns 
were realized only once, and expected frequencies are near zero. In this case, the usual 
chi-square approximation for the distribution of multinomial goodness of fit statistics is 
inaccurate. Haberman (1977) has shown, however, that the difference in fit statistics for 
alternative models is distributed in large samples as chi-square, with the degrees of 
freedom equals to the difference in numbers of degrees of freedom of the alternative 
models, even when the frequency table is sparse. These differences will be used in the 
comparisons of alternative models. 

Item-test correlation 

The item-test correlations are indicators of the strength of the relationship 
between what is being measured by the item and the overall construct measured by the 
set of items in the test. If the integrated items assess a different construct than the non- 
integrated items, they would be expected to correlate lower with the total test score than 
the non-integrated items. 

Item Fit and Local Item Independence 

The effects of content integration were also examined in terms of item fit and 
measures of local dependency. If the integrated items assess a different construct than 
non-integrated items, these items may be expected to fit the model poorly. Two different 
types of model fit analyses were used. The first model fit measures the fit of item 
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responses to the model. It was evaluated by a generalization of the Qi statistics to a 2- 
parameter partial credit model (Yen, 1981). The second type of fit analysis examined 
pair-wise local item dependence among the items by a generalization of Q 3 statistic to a 
2-parameter partial credit model (Yen, 1993). 

The Yen's Qi statistic compares the observed and predicted trace lines. To 
calculate this statistic, a trait estimate is obtained for each student using the student’s 
responses to all the items in the scale. Using the trait estimate and the item parameters 
estimated from a two-parameter partial credit model, the student’s expected performance 
on each item are estimated. The deviation between the observed and predicted 
performance is calculated and is initially referenced to a chi-square distribution. A 
mathematical conversion was done to reference the chi-square statistic to a Z-score 
distribution for ease of interpretation (Yen, 1993). 

A well-known problem of using chi-square statistic as measure of model fit is, for 
large sample sizes, small deviations from model predictions can yield large chi-square 
value with no practical importance. Rules of thumb have been developed to yield cut 
scores of Z value for flagging items with are practical significant. For the sample size of 
about 7500, items with Z greater than 19 were flagged as poor fit (1992, MSDE). 

Local item dependency (LID) violates a fundamental assumption of IRT. Writers 
such as Sireci, Thissen, and Wainer (1991) have found existence of local item 
dependency in reading comprehension. They found evidence of LID among reading 
comprehension items linked to the same reading passage. 

If item factor analysis suggests that there were some residual correlations left 
unexplained when fitting a one-factor model, these items may define the second or higher 
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factor. They are said to be locally dependent in the IRT terminology. In this situation, 
fitting a IRT model that assumes unidimensionality may be not appropriate. Therefore, it 
is necessary to examine the extent and the nature of the LID. It helps address the 
question in terms of whether we can differentiate properly the examinee's performance on 
both integrated and non-integrated items in a reliable way. It may also have implications 
about the validity of generalizing scores across integrated and non-integrated tasks (Yen, 
1993). 

The statistical procedure in identifying pairs of locally dependent items is based 
on a method proposed by Yen (1993). This method has been shown to yield similar 
results to other statistical methods for detecting LID (Ferrara, Huynh, & Michaels, 1996). 
To implement this procedure, the deviation between observed and predicted item 
performance for each examinee was obtained. A trait estimate is obtained for each 
student using the student’s response to all items in the scale. Using the trait estimate and 
item parameters, the student’s expected performance on each item is determined. The 
deviation between the student’s observed and expected item performance is calculated. 
The measure of LID between two given items is the correlation of these deviation taken 
over students. 

In order to have a context in which to evaluate the size of Qj statistic, it should be 
noted that in calculating this statistic, item score is included in both the estimating of 
expected performance and the observed performance (Yen, 1993). Therefore there is a 
known negative bias in Yen’s Qs statistic. When the local independence is true, the 
expected value is approximately -l/(n-l). In this case, the expected value of statistic is 
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-0.08 for a 13-item test. We used | Q3\>.20, as a criterion for flagging significant LED for 
further examination. 



Analysis and Results 

Item Factor Analysis 

The principal component analysis showed that the two correlation matrixes are 
both positive-definite. Two components were retained based on the eigen-value-greater- 
than-one rule for both forms. For form A, the two large eigen values are 4.52 and 1.51. 
The variance explained by the first and second factor was 38% and 13%, respectively. 
Together thy account for 51% of the standardized variance. For item factor analysis, this 
amount is considered acceptable because items by themselves are not very reliable. For 
form B, the two large eigen values are 5.35 and 1.58. The variance explained by the first 
and second factor was 41% and 12 %, with the total of 53%. For both forms, there seems 
to be a dominating factor underlying the data. 

Principal factor analysis was then conducted with the maximum correlation with 
any other items used as the communality estimates in the diagonal. The first two factors 
were rotated toward the hypothesis matrix by oblique promax solution. The correlation 
between the two factors is 0.55 for form A and 0.62 for form B. For both forms, the 
factor pattern show that all items related to the non-integrated task have high and positive 
loadings on the one factor while the science/reading integrated items have high and 
positive loadings on another factor. 

The plot of factor pattern, based on the oblique rotation, revealed the reference 
axis goes through two clusters of items for both forms. Since the Flarris-Kaiser rotation 
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did not improve the alignment, therefore, it was concluded that this is the best rotation 
attainable. Inspection of the plot reveals that each factor is dominated by items from a 
particular task. In other words, all the non-integrated reading items are well aligned on 
one principal axis; and all the reading and science integrated items aligned on the other 
axis. 

Confirmatory factor analysis was conducted using EQS (Rentier, 1998). Due to 
the categorical nature of some of the items examined, both the univariate and multivariate 
normality assumption were carefully examined. For form A, no items exhibit extreme 
skewness or excessive kurtosis. The normalized Mardia's coefficient estimate (.46) was 
not significantly different from zero. The assumption of multivariate normality appears 
not have been violated. For form B, no items exhibit extreme skewness or excessive 
kurtosis either. However, the normalized Mardia's coefficient estimate (1.37) indicate 
that the multivariate kurtosis may be excessive, but not statistically significant. 

The likelihood ratio tests of significance for the one factor, hierarchical, and bi- 
factor model as reported in Table 1 are all significant. The difference in fit statistics for 
alternative models is distributed in large samples as chi-square, with the degrees of 
freedom equals to the difference in numbers of degrees of freedom of the alternative 
models (Haberman, 1977). These differences will be used in the comparisons of 
alternative models. The one-factor model has the worst fit among the three models. The 
hierarchical model provided a significant improvement in fit over the one-factor model 
(G^= 188 for form A, G^= 191 for form B, df = 1, p < .00001),. The bi-factor model 
provided a significant improvement in fit over the hierarchical model (G^ = 56 for Form 
A, G^= 164 for form B, df= 14, p < .00001), suggesting that a general reading dimension 
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plus two task specific dimension are required to fully describe the data. The Chi-square to 
the degrees of freedom ratio (G7 dj) and Comparative Fit Index (CFIT) also suggests that 
the bi-factor model is the best fitting model among the three compared. The AIC index, 
however, suggests that for form A, the bi-factor model may be overfitting and the 
hierarchical model is a better fitting model. This could due to AIC index’s tendency to 
reward more parsimonious models. 



Table 1 



Confirmatory Factor Analysis 



Form A 




Model 




df 


oydf 


CFIT 


AIC 




One Factor 


334.56 


65 


5.15 


.50 


13.16 




Hierarchical 


146.44 


64 


2.29 


.88 


-349.63 




Bi-Factor 


89.95 


50 


1.80 


.94 


-271.11 


Form B 




One Factor 


503.51 


65 


7.75 


.60 


-160.62 




Hierarchical 


312.12 


64 


4.88 


.72 


-178.46 




Bi-Factor 


148.03 


50 


2.96 


.89 


-206.46 



There is significant residual variation from a one-factor model. For form A, 
average absolute standardized residuals, from the one-factor model, is 0.08. However, 
most of the standardized residuals are relatively small. Seventy-two percent of the 
standardized residuals with absolute value less than .10. The largest standardized 
residual (.26) is accounted for by the residual correlation between item 6 and 7. The 
implication is when a data are calibration using IRT models that assume 
unidimensionality, items with high residuals may be locally dependent on other items, to 
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be discussed later. With the bi-factor model, the average absolute standardized residuals 
reduced to ,03 and none of the standardized residuals are greater than ,10, 

Similar results were found for form B. Average absolute standardized residuals 
from the one-factor model, is very small: .05; and 67% percent of the standardized 
residuals with absolute value less than .10. The largest standardized residual (.26) is 
accounted for by the residual correlation between item 4 and 5. With the bi-factor model, 
average absolute standardized residuals reduced to .05 and 80% percent of the 
standardized residuals are less than .10. However, the standardized residual correlation 
for item 4 and 5 remains high (.20). Although fitting a more complicated model 
decreased the overall standardized residuals, there are still relatively large co-variation 
between these two items still not accounted for by the bi-factor model. 

The results from the item factor analysis seems to indicate that the test is 
essentially unidimensional (Stout, 1990). The first latent root is 38% and 41% of the 
variance for form A and B respectively, while the rest explain very little of the variance. 
It is clear that the vast preponderance of variance and covariance is associated with the 
first factor. 

The bi-factor model, although did not fit the data based on the fit statistic, does 
indicate that there is a general latent dimension and two minor task-related dimensions. 
The existence of a dominating factor is supported by the general factor. There are clearly 
task-related factors. However, whether the minor factors account for a lot of the observed 
covariance among the items requires further investigation. These two task-related 
factors appear to be content-based local dependence (Yen, 1993) among the items. 
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Item-test Correlation 

Cronbach’s coefficient alpha, proportion correct, item-test correlations, and Yen’s 
Q, statistic are reported in Table 2 for both forms. Coefficient alpha suggests that the 
reliability of both forms is very good, given that there are only 13 items in each form. 
The proportion correct was calculated for each item by dividing student’s average score 
on the item by the maximum possible score on the item. For form A, integrated items 
seem to be more difficult (average proportion correct=.40) than the single content area 
item (average proportion correct=.59). Interestingly, the item-test correlations suggest 
that the integrated items have a stronger relationship with test than the non-integrated 
items. For form B, the integrated items are, on average, more difficult than the non- 
integrated items; however, the item-test correlations for the non-integrated items seem to 
be higher than those of the integrated items. The item-total correlations are affected by 
the number of score levels, higher numbers of score levels tend to have higher item-test 
correlations. This could explain why different results were obtained for the two forms. 
However, the overall results suggest that the integrated items measure more or less the 
same construct as the non-integrated items. 

Item Fit 

Yen’s Qi statistics are reported in the last column of Table 1. For both forms, all 
7,502 student’s responses were included in the computation. The fit statistics suggest 
that the integrated items fit the two-parameter partial credit model as well as the non- 
integrated items. Only item 13 in form B was flagged as a poor fit using the Z>19 
criterion discussed earlier. Item 1 1 in form B also showed a high fit statistic, although it 
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was not flagged. Both items are associated with the non-integrated reading task in form 
B. Careful examination of the expected and observed tracelines seem warranted to further 
explored where the misfit occurred. Deviation between observed and expected 
frequencies for each decile indicate that the misfit is due to sparse data at low end of the 
ability distribution for item both items. A problem that is not unusual in fitting 
performance assessment items to IRT model. 
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Table 2 



Proportion Correct, Item-Test Correlations, and Item Fit statitics' 





Item 


Score 

Levels 


Proportion 

Correct 


Item-Test 

Correlation 


Yen's 0; 


Form A (Cronbach’s 


alpha=.86) 










Integrated 


1 


3 


.74 


.56 


3.73 




2 


3 


.41 


.63 


10.60 




3 


4 


.47 


.62 


6.86 




4 


4 


.26 


.66 


10.34 




5 


3 


.36 


.63 


5.86 




6 


4 


.24 


.65 


5.68 




7 


4 


.34 


.65 


13.98 


Non- 

integrated 


8 


3 


.45 


.54 


2.11 




9 


3 


.52 


.62 


2.94 




10 


3 


.73 


.54 


9.88 




11 


3 


.62 


.62 


7.01 




12 


3 


.54 


.59 


2.21 




13 


3 


.65 


.61 


3.60 


Form B (Cronbach’s alpha=.88) 










Integrated 


1 


3 


.77 


0.61 


5.80 




2 


2 


.72 


0.50 


7.71 




3 


4 


.59 


0.62 


16.85 




4 


3 


.45 


0.66 


9.31 




5 


3 


.35 


0.63 


4.47 




6 


3 


.55 


0.67 


3.96 




7 


3 


.33 


0.56 


6.71 


Non- 

integrated 


8 


3 


.69 


0.65 


9.03 




9 


4 


.56 


0.72 


9.03 




10 


3 


.51 


0.63 


6.04 




11 


4 


.42 


0.69 


18.60 




12 


3 


.63 


0.69 


10.62 




13 


3 


.61 


0.67 


24.56 



' Sample size is 7,502 for both forms 
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Local Item Dependency 

If items were locally dependent, then there would be nonzero residual correlations 
after accounting for the first factor. Based on the results from item factor analysis, it 
seems that there are some residuals correlations left unexplained under a one-factor 
model. Therefore, we should expect some items to exhibit local dependency when 
scaling the items using item response models that assume unidimensionality. The issue 
to explore here was whether the items showed LID due to communality of within task or 
across tasks. 

To investigate this issue, the Qi statistic was separated into Within-Task and 
Across-Tasks. For a set of n item test, there are n(n-l)/2 pairs of Qs statistics. The 
Within-Task LID was further sorted by task, each has 7 and 6 items respectively. 
Therefore, there are 21 within integrated task statistics and 15 within non-integrated 
task Qs statistics to examine. Across-Tasks LID involves examining the dependency 
between two sets of items from each task, resulting in 42 (7 x 6) Qs statistics. The 
frequencies of Q 3 values ware summarized in Table 3. 

The criterion used for flagging significant LID is when \Q 3 \>. 20 . The stem and 
plot frequency shows that there is only one item pair (item 8 and 9) in form A exhibits 
high and positive within task LID. These two items were organized in steps, knowing 
item 8 increases the chances of a student’s chance of getting a high score on item 9. This 
seems to reflect what Yen (1993) refers to as Item-Chaining effect. Two item pairs (item 
4 and 5, item 12 and 13) had a high and positive within task LID in form B. They also 
reflect the Item-Chaining effect. Positive LID means that if a student performs higher 
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than expected on one item, he or she will probably perform higher than expected on 
another item, and vice versa. 



Table 3 



Frequency of the Within-Task and Across-Tasks Q 3 Statistic 



Form A 







Within Task LID 




Across-Tasks LID 




Integrated 


Non-Integrated 






-0.2 




-0.2 




-0.2 


17 


-0.1 


1 


-0.1 




-0.1 


25 


-0.0 


13 


-0.0 




-0.0 




0.0 


4 


0.0 


8 


0.0 




0.1 


2 


0.1 


6 


0.1 




0.2 




0.2 


1 


0.2 




Sum 


21 




15 




42 



Form B 






Within Task LID 




Across-Tasks LID 




, Integrated 


Non; 


-Integrated 




-0.3 




-0.3 




-0.3 


1 


-0.2 




-0.2 




-0.2 


16 


-0.1 




-0.1 




-0.1 


25 


-0.0 




-0.0 


8 


-0.0 




0.0 


13 


0.0 


5 


0.0 




0.1 


7 


0.1 


1 


0.1 




0.2 




0.2 


1 


0.2 




0.3 




0.3 




0.3 




0.4 




0.4 




0.4 




0.5 


1 


0.5 




0.5 




Sum 


21 




15 




42 
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Cross-Task LID occurred more frequently than within task LID and merits further 
investigation. This result seems to contradict the expectation that within passage LID 
would be high when items are tied to a particular passage (Hozinger, K. J. & Swinefor, F. 
1937). Table 4 reports the seventeen item pairs being flagged as having large and 
negative Q 3 statistic for both forms. Negative LID means if a student performs lower than 
expected on one item, he or she will probably perform higher than expected on another 
item, and vice versa. The indicate the LID was identified between the item pair. 
They seem to reflect what Yen (1993) refers to as the Content Related LID. For form A, 
the science and reading integrated task measures the extent to which the examinees can 
extract scientific information from the text (Reading for Information). The non- 
integrated reading task, in contrast, requires the examinees to identify and explain why 
one particular reading passage was better in helping students perform an investigation 
(Reading to Perform). For form B, the integrated task measures whether the students can 
follow directions in text to perform a scientific experiment (Reading to Perform). The 
non-integrated task, on the other hand, asks student to discuss a poem that they read on 
the test (Reading for Literacy Experience). 
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Table 4 

Across Tasks LED: Item Pairs with maximum O 3 



Form A 


Item 


Integrated 

(Reading for Information) 

1 2 3 4 5 6 7 




8 






* 


* 


Non-integrated 


9 


* 


* 


* 


* 


(Reading to Perform) 


10 












11 


* 


* 


* 


* 




12 




* 


* 


* 




13 


* 


* 


* 


* 



Form B 



Integrated 

(Reading to Perform) 





Item 


1 


2 3 4 


5 6 7 




8 




* 


* 


No-integrated 


9 


* 


* * 


* * * 


(Literary Experience) 












10 




* 


* 




11 




* 


* 




12 




* 


* 




13 




* 


* * 
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Conclusions and Discussions 

Performance assessment differs in many aspects from the traditional multiple 
choice items. There is a need to cover a broad range of content in an authentic, real-life 
context. Rather than thriving to create independent items, a deliberate effort made when 
constructing multiple choice items, performance assessment engages students in a series 
of items related to the same topic. To meet the goal of authenticity, it is necessary to 
create items that may be dependent. To meet the goal of mirroring classroom instruction, 
it is necessary to create items integrating different content areas in the assessment. 
Content integration, in particular, encourages students to involve a variety of skills either 
learned from their curricular or prior knowledge. Such performance assessment is likely 
to face greater psychometric challenge then the multiple-choice items. It is therefore 
essential to gauge the flexibility offered by performance assessment against any potential 
negative psychometric consequence. 

The effect of content integration on the construct validity of reading assessment 
was examined in this study. Item Factor analysis and item-test correlation was conducted 
to examine the similarity of the construct measured by the integrated and non-integrated 
items. The results from the exploratory factor analysis indicates that the test is essentially 
unidimensional (Stout, 1990). Size of factors after the first, as opposed to their statistical 
significance, is one major indication. The vast preponderance of variance and covariance 
is associated with the first factor while the rest of factors explained very little of the total 
variance. 

There are statistically significant factors for the task-related factors, orthogonal to 
the general factor, as indicated by the confirmatory factor analysis. However, there is 
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clear evidence that the integrated items measure predominantly the same construct as the 
non-integrated items. The analysis based on item-test correlations give additional support 
of this conclusion. 

The bi-factor model indicates that there is a general latent dimension and two 
minor task-related dimensions. The existence of a dominating factor is supported by the 
general factor. There are clearly task-related factors. However, whether the minor factors 
account for much of the observed covariance among the items requires further 
investigation. Since the main purpose of this study is not to examine the latent structure 
of the reading items in question, further analysis on the relationship between the major 
reading factor and the minor task factors may shade some light on this issue. 

Stout (1990) establishes that "essential unidimensionality" and "essential 
independence" is sufficient to justify unidimensional scoring. Essential independence 
differs from local independence in that small, specific factors may be present but as the 
number of items becomes large, the average residual covariance from a one-factor model 
approaches zero. We found that the average residual covariance from a one-factor model 
was close to 0, indicating that essential independence appears to hold for the current data 
sets. 

Is it meaningful to combine scores from integrated and non-integrated items, to 
yield a single reported score on reading? The evidence is clear that the scores definitely 
may be combined especially when IRT weights are used. Our analysis showed that 
integrated items fit the 2-parameter partial credit model as well as the non-integrated 
models. The existence of the minor task related factors seem to cause some Cross-Task 
LID, however, the problem does not appear to have negative measurement consequences. 
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For example, item 6 and 7 in form A, although showed large standardized residual from 
the one-factor model, fit the 2-parameter partial credit model very well. 

The Within-Task LID, although present, is no more problematic than assessment 
based on multiple-choice items (Yen, 1993). For the MSPAP reading assessment, efforts 
were made to write items which elicit student’s responses that are as independent as 
possible within a task. It is evident that this effort results in minimum LID within tasks. 
Within-Task LIDs we found are all due to Item-Chaining. However, Item- Chaining is 
desirable in performance assessment because it models real life situations. 

Most of the LID problems seem due to the communality across tasks. This 
finding is consistent with the findings from the item factor analysis. There seems to be 
task specific factors being measured, orthogonal to the general reading factor. In 
particular, ‘Performing a Task’ seems to involve specific cognitive skills that are above 
and beyond the general reading dimension measured by all the items. The cognitive 
demand for ‘Reading for Information”, similarly, involves some unique cognitive skills 
that are beyond the general reading dimensions. Since these additional factors may be 
desirable because they mirror classroom instructional practice, further studies in the 
cognitive skills that are required in answering the integrated versus non-integrated items 
may facilitate a deeper understanding about student’s learning. 

Limitations of the Study 

It would be valuable to examine the performance of other LID measures other 
than 03 statistic. Measures that are less affected by the item scores would be more 
desirable. For a relatively short test such us ours, substantial negative Qj values are 
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expected due to part-whole contamination. For example, if a student score relatively 
higher on a non-integrated item than an integrated item, the student’s true response will 
be the average of two item scores. Then the student’s expected performance would be 
too low for the non-integrated item and too high for the integrated items. This may 
explain the large number of cross-task LID found in this study. 

Although the results suggest Cross-Task LID did not result in poor fitting items 
when IRT calibration was applied, the impact of Cross-Task LID to test information and 
standard error of measurement (SEM) need further investigation. Thissen, Steinberg, and 
Mooney (1989) and Sirei, Thissen, and Wainer (1991) pointed out that an inappropriate 
assumption of local item independence produces overestimates of test information and 
reliability and underestimation of SEM. They recommended the use of testlets. One 
reason testlet were not currently used in MSPAP was that items can form a testlet only if 
they belong to the same passage or task. Plus when testlets are formed, the items 
contributing to the testlet no longer remain as separate items in the total test scores. In 
the assessment where reporting outcome scores are essential, the use of testlets 
diminishes the meaning of outcome score. Other innovative ways of removing Cross- 
Task LID should be further researched to provide an alternative to testlet. 

Content integration often involves items that measure a range of different 
content. It was meant to encourage students to involve a variety of skills they learned in 
other content area, from their curriculum, or their prior knowledge. However, one 
negative consequence is the potential LID when items measure unique content. When 
performance is differentially affected by exposure in the curriculum or prior knowledge. 
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items may show LID (Yen, 1993). It would be valuable to examine whether they may 
show differential item functioning as well. 

These additional content-related factors, while desirable because they measure 
important dimensions in the context of integrated assessment, may represent some 
nuisance dimension that the test did not purport to measure. Shealy and Stout (1996) 
conceptualize that differential item functioning (DIF) is the multidimensional influences 
on item responses that render a test less valid for one group of examinees than for 
another. They utilize the terms, target ability, which is a latent trait that test is intended 
to measure, and nuisance determinates, as the unintended traits that influence some 
portion of the examinees’ responses to test items. Therefore, it is essential to investigate 
DIF in an effort to determine the degree of influence of ‘nuisance determinants’ on 
responses to the test items. For example, it would be valuable to identify whether the 
nuisance determinates are related to Content-Based LID and whether the DIF can be 
expressed in one or more items whose responses depend on the contents. 
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